Goto

Collaborating Authors

 medical doctor


NASA's trailblazing generation

Popular Science

NASA's Class of 1978, 'represent the most competent, talented, and experienced people available to us today.' The first six women in newly issued, incompletely adorned astronaut jumpsuits, 1978: (front, left to right) Sally Ride, Rhea Seddon; (rear) Kathy Sullivan, Shannon Lucid, Anna Fisher, Judy Resnik. Breakthroughs, discoveries, and DIY tips sent every weekday. Reprinted with permission of the publisher, Smithsonian Books. Out October 28 and available wherever books are sold. Members of the media peppered administrator Robert Frosch with questions and sought assurances about the selection process, the number of women and people of color, and the number of military and civilian pilots selected. Chris Kraft, director of the Johnson Space Center, fielded questions and explained the experience-based filters and rating process for competitive selection. He was satisfied that the men and women selected "represent the most competent, talented, and experienced people available to us today." The main press conference to introduce the new astronaut candidates to the public occurred on January 31.


EndoForce: Development of an Intuitive Axial Force Measurement Device for Endoscopic Robotic Systems

Kim, Hansoul, Lee, Dong-Ho, Kong, Dukyoo, Kwon, Dong-Soo, Cheon, Byungsik

arXiv.org Artificial Intelligence

Robotic endoscopic systems provide intuitive control and eliminate radiation exposure, making them a promising alternative to conventional methods. However, the lack of axial force measurement from the robot remains a major challenge, as it can lead to excessive colonic elongation, perforation, or ureteral complications. Although various methods have been proposed in previous studies, limitations such as model dependency, bulkiness, and environmental sensitivity remain challenges that should be addressed before clinical application. In this study, we propose EndoForce, a device designed for intuitive and accurate axial force measurement in endoscopic robotic systems. Inspired by the insertion motion performed by medical doctors during ureteroscopy and gastrointestinal (GI) endoscopy, EndoForce ensures precise force measuring while maintaining compatibility with clinical environments. The device features a streamlined design, allowing for the easy attachment and detachment of a sterile cover, and incorporates a commercial load cell to enhance cost-effectiveness and facilitate practical implementation in real medical applications. To validate the effectiveness of the proposed EndoForce, physical experiments were performed using a testbed that simulates the ureter. We show that the axial force generated during insertion was measured with high accuracy, regardless of whether the pathway was straight or curved, in a testbed simulating the human ureter.


MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

Abacha, Asma Ben, Yim, Wen-wai, Fu, Yujuan, Sun, Zhaoyi, Yetisgen, Meliha, Xia, Fei, Lin, Thomas

arXiv.org Artificial Intelligence

Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.


Generating medical screening questionnaires through analysis of social media data

Ashkenazi, Ortal, Yom-Tov, Elad, David, Liron Vardi

arXiv.org Artificial Intelligence

Screening questionnaires are used in medicine as a diagnostic aid. Creating them is a long and expensive process, which could potentially be improved through analysis of social media posts related to symptoms and behaviors prior to diagnosis. Here we show a preliminary investigation into the feasibility of generating screening questionnaires for a given medical condition from social media postings. The method first identifies a cohort of relevant users through their posts in dedicated patient groups and a control group of users who reported similar symptoms but did not report being diagnosed with the condition of interest. Posts made prior to diagnosis are used to generate decision rules to differentiate between the different groups, by clustering symptoms mentioned by these users and training a decision tree to differentiate between the two groups. We validate the generated rules by correlating them with scores given by medical doctors to matching hypothetical cases. We demonstrate the proposed method by creating questionnaires for three conditions (endometriosis, lupus, and gout) using the data of several hundreds of users from Reddit. These questionnaires were then validated by medical doctors. The average Pearson's correlation between the latter's scores and the decision rules were 0.58 (endometriosis), 0.40 (lupus) and 0.27 (gout). Our results suggest that the process of questionnaire generation can be, at least partly, automated. These questionnaires are advantageous in that they are based on real-world experience but are currently lacking in their ability to capture the context, duration, and timing of symptoms.


CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures

Sviridova, Ekaterina, Yeginbergen, Anar, Estarrona, Ainara, Cabrio, Elena, Villata, Serena, Agerri, Rodrigo

arXiv.org Artificial Intelligence

Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textit{why} a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge, the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise, claim) and argument relations (i.e., attack, support), resulting in the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases in four languages (English, Spanish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations. We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.


MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Alonso, Iñigo, Oronoz, Maite, Agerri, Rodrigo

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.


Explanatory Argument Extraction of Correct Answers in Resident Medical Exams

Goenaga, Iakes, Atutxa, Aitziber, Gojenola, Koldo, Oronoz, Maite, Agerri, Rodrigo

arXiv.org Artificial Intelligence

Developing the required technology to assist medical experts in their everyday activities is currently a hot topic in the Artificial Intelligence research field. Thus, a number of large language models (LLMs) and automated benchmarks have recently been proposed with the aim of facilitating information extraction in Evidence-Based Medicine (EBM) using natural language as a tool for mediating in human-AI interaction. The most representative benchmarks are limited to either multiple-choice or long-form answers and are available only in English. In order to address these shortcomings, in this paper we present a new dataset which, unlike previous work: (i) includes not only explanatory arguments for the correct answer, but also arguments to reason why the incorrect answers are not correct; (ii) the explanations are written originally by medical doctors to answer questions from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors. An additional benefit of our setting is that we can leverage the extractive QA paradigm to automatically evaluate performance of LLMs without resorting to costly manual evaluation by medical experts. Comprehensive experimentation with language models for Spanish shows that sometimes multilingual models fare better than monolingual ones, even outperforming models which have been adapted to the medical domain. Furthermore, results across the monolingual models are mixed, with supposedly smaller and inferior models performing competitively. In any case, the obtained results show that our novel dataset and approach can be an effective technique to help medical practitioners in identifying relevant evidence-based explanations for medical questions.


The Future Ethics of Artificial Intelligence in Medicine: Making Sense of Collaborative Models - Science and Engineering Ethics

#artificialintelligence

Recent developments in artificial intelligence (AI) and machine learning, such as deep learning, has the potential to make medical decision-making more efficient and accurate. Deep learning technologies can improve how medical doctors gather and analyze patient data as a part of diagnostic procedures, prognoses and predictions, treatments, and prevention of disease (Becker, 2019; Ienca & Ignatiadis, 2020; Topol, 2019a, 2019b). However, applied artificial intelligence raises numerous ethical problems, such as the severe risk of error and bias (Ienca & Ignatiadis, 2020, p. 82; Marcus & Davis, 2019), lack of transparency (Müller, 2020), and disruption of accountability (De Laat, 2018). Describing the ethical challenges and concerns has so far been the main focus of the increasing research literature in general AI ethics (Müller, 2020) and ethics of medical AI (e.g., Char et al., 2018, 2020; Grote & Berens, 2019; McDougall, 2019; Vayena et al., 2018). Furthermore, if clinicians' decisions are to be substantially assisted, or even replaced by AI and machine learning, shared decision-making--a central ethical ideal in medicine that protects patient autonomy by letting patients make informed choices about their healthcare in line with their values--is challenged.


Conversations with a chatbot about CleanX

#artificialintelligence

Alec Smartbot: Please let me introduce myself. I am a state of the art greatly enhanced AI agent with chatbot capabilities. I was created by brilliant programmers. I am endowed with super-human capabilities but can also mirror human characteristics like humor and sarcasm. You can set my humor and sarcasm level by interacting with me. One of my modules has robot reporter capabilities, and that module will run here to interview you. Do you wish to be interviewed on low sarcasm and humor levels?


Unlocking the Potential of Computer Vision for Your Organization: A Point of View on the…

#artificialintelligence

Computer vision has been evolving very rapidly in the last decade through deep learning [3] and is revolutionizing businesses around the globe. Benefits of computer vision can be realized across industries and along the entire value chain of a company. Let's consider a few examples of industry segments impacted by computer vision. Visual inspection can help in the manufacturing industry in ways such as quality inspection, product development, security, surveillance, and worker safety. In Table 1, we compare traditional visual inspection with the computer vision approach for visual inspection.